Machine learning is a combination of statistics and computer science that is often defined as an algorithm, computer, or other machine that can learn on its own as it is given more data and without human input. This definition easily becomes conflated with opaque ideas of “artificial intelligence” and is too frequently associated with romanticized technological advancements in transportation, robotics, cognition, and human-computer interaction - although it does apply to these fields.
However, we thankfully do not need to think of machine learning in such complicated terms. Instead, it can be thought of as a toolbox for exploring data for application to research problems in virtually all fields of study. We can think of machine learning as a
“vast set of tools for understanding data. These tools can be classified as supervised or unsupervised. Broadly speaking, supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs… With unsupervised statistical learning, there are inputs but no supervising output; nevertheless we can learn relationships and structure from such data.” (James et al. 2021, p. 1).
Machine learning tasks are generally defined as belonging to one of a small handful of groups: supervised, unsupervised, deep, semi-supervised, reinforcement, and targeted/causal.
Our focus in this workshop is on supervised learning for classification using the SuperLearner R package and we will follow the Guide to SuperLearner.
“We wish to fit a model that relates the response to the predictors, with the aim of accurately predicting the response for future observations (prediction) or better understanding the relationship between the response and the predictors (inference).” (James et al. 2021, p. 26)
Classification: is the supervised task when the Y variable is discrete/categorical.
Regression: is the supervised task when the Y variable is continuous (i.e., numeric or integer).
\(Y ~ X1 + X2 + X3 ... + Xn\)
Simply put, we want to use the X variables to predict Y.
X is/are the independent variable(s) that we use to do the predicting. These are also referred to as features, covariates, predictors, and explanatory/input variables.
Y is the dependent variable and the one we want to predict. It is also referred to as the outcome, target, or response variable. Although predicting the label itself might be convenient, predicting the class probabilities is more efficient.
NOTE: a validation set is also sometimes used for hyperparameter tuning/model selection on the training dataset.
The training set: generally consists of the majority portion of the original dataset (70%, for example) where the model can learn the relationships between the X and Y variables.
The test set: consists of the remaining portion of the dataset (30% in this example) that the trained model will then try to predict without seeing the Y labels.
k-fold cross-validation: is a preferred method for approaching the data splitting process because it repeats the train/test split process “k” number of times and rotates portions of the dataset to ensure that each observation is in the test set at least once.
We will focus on two classification metrics in this workshop: 1. Risk - as measured by mean squared error 2. AUC-ROC - area under the curve - receiver operator characteristic; a measure of true positive rate versus true negative rate
A model is underfit if it cannot learn the relationships between the X and Y variables on the training set.
A model is overfit if it adequately learns the relationships between the X and Y variables and performs well on the training set, but performs poorly on the test set.
We investigate how well individual algorithms and a SuperLearner ensemble weighted average can predict heart disease (yes/no) using other health indicators as features/predictors. Learn more about the dataset at the UCI Machine Learning Repository
SuperLearner is an algorithm that uses cross-validation to estimate the performance of multiple machine learning models, or the same model with different settings. It then creates an optimal weighted average of those models, aka an “ensemble”, using the test data performance. This approach has been proven to be asymptotically as accurate as the best possible prediction algorithm that is tested. Guide to SuperLearner - 1 Background
In this manner, a SuperLearner ensemble is a powerful tool because it:
Eliminates bias of single algorithm selection for framing a research problem
Allows for comparison of multiple algorithms, and/or comparison of the same model but tuned in many different ways
Utilizes a second-level algorithm that produces an ideal weighted prediction that is suitable for data of virtually all distributions and uses external cross-validation to prevent overfitting.
Read the papers:
Install and library the packages we will use in this workshop. The pacman package management tool handles everything for you in the code chunk below.
You should consider using it (or something like it) for your own research. Read the documentation here.
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load("caret", # create stratified random split of the dataset
"ck37r", # Chris Kennedy's Machine Learning Helper Toolkit
"ggplot2", # visualize risk and ROC
"glmnet", # the elastic net algorithm
"ranger", # random forest algorithm
"rpart", # decision tree algorithm
"scales", # for scaling data
"SuperLearner", # fit individual algorithms and ensembles
"ROCR", # compute AUC-ROC performance
"xgboost") # boosted tree algorithm
If the CRAN installation for the ck37r package does not work correctly, install it from GitHub by unhashtagging the three lines of code below.
# install.packages("remotes")
# remotes::install_github("ck37/ck37r")
# library(ck37r)
Import and preprocess the data in six steps, and create a new variable for each step:
Import the dataset: The variable raw contains data from the raw .csv file.
Convert categorical variables: raw_fac is used to convert categorical variables to factor type.
Convert factors to numeric indicators: raw_df will be for conversion of factors to numeric indicators.
Identify and remove variables to be scaled: removed consists of the preprocessed data excluding the continuous variables to be scaled.
Scale the continuous variables: rescaled is the variable for the scaled variables.
Produce clean dataset: clean is the final merged dataset; a combination of the variables removed and rescaled.
Save the clean dataset: using the save function. You can load it with the load function so you do not have to repeat these preprocessing steps again.
Read in the raw .csv file and save it in a variable named raw.
raw <- read.csv("data/raw/heart.csv")
str(raw)
## 'data.frame': 303 obs. of 14 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ cp : int 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : int 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : int 1 2 2 2 2 1 2 3 3 2 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
The variables cp, restecg, slope, ca, and thal are variables to be converted to nominal categorical type. Perhaps ordinal factors could be even more appropriate.
Save this factorized version in a variable named raw_fac.
raw_fac <- ck37r::categoricals_to_factors(data = raw,
categoricals = c("cp", "restecg", "slope", "ca", "thal"))
str(raw_fac)
## 'data.frame': 303 obs. of 14 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ cp : Factor w/ 4 levels "0","1","2","3": 4 3 2 2 1 1 2 2 3 3 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : Factor w/ 3 levels "0","1","2": 1 2 1 2 2 2 1 2 2 2 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : Factor w/ 3 levels "0","1","2": 1 1 3 3 3 2 2 3 3 3 ...
## $ ca : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ thal : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 3 2 3 4 4 3 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
Most machine learning algorithms require that input and output variables are numeric and therefore do not handle factor data well (although with some exceptions).
Therefore, we want to convert the factor variables to numeric indicator variables (aka one-hot/dummy coding). See a few examples here.
Name this variable raw_df.
# Save this as an intermediate variable named raw_ind
raw_ind <- ck37r::factors_to_indicators(data = raw_fac,
verbose = TRUE)
## Converting factors (5): cp, restecg, slope, ca, thal
## Converting cp from a factor to a matrix (4 levels).
## : cp_1 cp_2 cp_3
## Converting restecg from a factor to a matrix (3 levels).
## : restecg_1 restecg_2
## Converting slope from a factor to a matrix (3 levels).
## : slope_1 slope_2
## Converting ca from a factor to a matrix (5 levels).
## : ca_1 ca_2 ca_3 ca_4
## Converting thal from a factor to a matrix (4 levels).
## : thal_1 thal_2 thal_3
## Combining factor matrices into a data frame.
# Extract the actual data portion from raw_ind and save as raw_df
raw_df <- raw_ind$data
str(raw_df)
## 'data.frame': 303 obs. of 23 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ trestbps : int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
## $ cp_1 : int 0 0 1 1 0 0 1 1 0 0 ...
## $ cp_2 : int 0 1 0 0 0 0 0 0 1 1 ...
## $ cp_3 : int 1 0 0 0 0 0 0 0 0 0 ...
## $ restecg_1: int 0 1 0 1 1 1 0 1 1 1 ...
## $ restecg_2: int 0 0 0 0 0 0 0 0 0 0 ...
## $ slope_1 : int 0 0 0 0 0 1 1 0 0 0 ...
## $ slope_2 : int 0 0 1 1 1 0 0 1 1 1 ...
## $ ca_1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ca_2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ca_3 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ca_4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal_1 : int 1 0 0 0 0 1 0 0 0 0 ...
## $ thal_2 : int 0 1 1 1 1 0 1 0 0 1 ...
## $ thal_3 : int 0 0 0 0 0 0 0 1 1 0 ...
Identify and remove the variables age, trestbps, chol, thalach, and oldpeak to be scaled.
# First, investigate the data to identify variables to scale
names(raw_df)
## [1] "age" "sex" "trestbps" "chol" "fbs" "thalach"
## [7] "exang" "oldpeak" "target" "cp_1" "cp_2" "cp_3"
## [13] "restecg_1" "restecg_2" "slope_1" "slope_2" "ca_1" "ca_2"
## [19] "ca_3" "ca_4" "thal_1" "thal_2" "thal_3"
summary(raw_df)
## age sex trestbps chol
## Min. :29.00 Min. :0.0000 Min. : 94.0 Min. :126.0
## 1st Qu.:47.50 1st Qu.:0.0000 1st Qu.:120.0 1st Qu.:211.0
## Median :55.00 Median :1.0000 Median :130.0 Median :240.0
## Mean :54.37 Mean :0.6832 Mean :131.6 Mean :246.3
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:140.0 3rd Qu.:274.5
## Max. :77.00 Max. :1.0000 Max. :200.0 Max. :564.0
## fbs thalach exang oldpeak
## Min. :0.0000 Min. : 71.0 Min. :0.0000 Min. :0.00
## 1st Qu.:0.0000 1st Qu.:133.5 1st Qu.:0.0000 1st Qu.:0.00
## Median :0.0000 Median :153.0 Median :0.0000 Median :0.80
## Mean :0.1485 Mean :149.6 Mean :0.3267 Mean :1.04
## 3rd Qu.:0.0000 3rd Qu.:166.0 3rd Qu.:1.0000 3rd Qu.:1.60
## Max. :1.0000 Max. :202.0 Max. :1.0000 Max. :6.20
## target cp_1 cp_2 cp_3
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :1.0000 Median :0.000 Median :0.0000 Median :0.00000
## Mean :0.5446 Mean :0.165 Mean :0.2871 Mean :0.07591
## 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.00000
## restecg_1 restecg_2 slope_1 slope_2
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
## Median :1.0000 Median :0.0000 Median :0.000 Median :0.0000
## Mean :0.5017 Mean :0.0132 Mean :0.462 Mean :0.4686
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.0000
## ca_1 ca_2 ca_3 ca_4
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.2145 Mean :0.1254 Mean :0.06601 Mean :0.0165
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000
## thal_1 thal_2 thal_3
## Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :1.0000 Median :0.0000
## Mean :0.05941 Mean :0.5479 Mean :0.3861
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000 Max. :1.0000
summary(raw_df[,c("age", "trestbps", "chol", "thalach", "oldpeak")])
## age trestbps chol thalach oldpeak
## Min. :29.00 Min. : 94.0 Min. :126.0 Min. : 71.0 Min. :0.00
## 1st Qu.:47.50 1st Qu.:120.0 1st Qu.:211.0 1st Qu.:133.5 1st Qu.:0.00
## Median :55.00 Median :130.0 Median :240.0 Median :153.0 Median :0.80
## Mean :54.37 Mean :131.6 Mean :246.3 Mean :149.6 Mean :1.04
## 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:274.5 3rd Qu.:166.0 3rd Qu.:1.60
## Max. :77.00 Max. :200.0 Max. :564.0 Max. :202.0 Max. :6.20
# Save this as an intermediate variable named to_scale
to_scale <- raw_df[,c("age", "trestbps", "chol", "thalach", "oldpeak")]
Also see
?ck37r::standardizefor a simple extension
Remove the variables to be scaled and save the remaining variables in a dataframe named removed.
removed <- raw_df[,-c(1, # age
3, # trestbps
4, # chol
6, # thalach
8 # oldpeak
)]
Rescale these variables to a rang of 0 to 1 in a variable named rescaled.
rescaled <- as.data.frame(rescale(as.matrix(to_scale), to = c(0,1), ncol = 1))
summary(rescaled)
## age trestbps chol thalach
## Min. :0.05142 Min. :0.1667 Min. :0.2234 Min. :0.1259
## 1st Qu.:0.08422 1st Qu.:0.2128 1st Qu.:0.3741 1st Qu.:0.2367
## Median :0.09752 Median :0.2305 Median :0.4255 Median :0.2713
## Mean :0.09639 Mean :0.2334 Mean :0.4366 Mean :0.2653
## 3rd Qu.:0.10816 3rd Qu.:0.2482 3rd Qu.:0.4867 3rd Qu.:0.2943
## Max. :0.13652 Max. :0.3546 Max. :1.0000 Max. :0.3582
## oldpeak
## Min. :0.000000
## 1st Qu.:0.000000
## Median :0.001418
## Mean :0.001843
## 3rd Qu.:0.002837
## Max. :0.010993
Finally, recombine the original data with the scaled variables for the final clean dataset.
clean <- cbind(removed, # cleaned data without the scaled variables
rescaled # scaled variables
)
# The scaled variables are the last five!
str(clean)
## 'data.frame': 303 obs. of 23 variables:
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
## $ cp_1 : int 0 0 1 1 0 0 1 1 0 0 ...
## $ cp_2 : int 0 1 0 0 0 0 0 0 1 1 ...
## $ cp_3 : int 1 0 0 0 0 0 0 0 0 0 ...
## $ restecg_1: int 0 1 0 1 1 1 0 1 1 1 ...
## $ restecg_2: int 0 0 0 0 0 0 0 0 0 0 ...
## $ slope_1 : int 0 0 0 0 0 1 1 0 0 0 ...
## $ slope_2 : int 0 0 1 1 1 0 0 1 1 1 ...
## $ ca_1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ca_2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ca_3 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ca_4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal_1 : int 1 0 0 0 0 1 0 0 0 0 ...
## $ thal_2 : int 0 1 1 1 1 0 1 0 0 1 ...
## $ thal_3 : int 0 0 0 0 0 0 0 1 1 0 ...
## $ age : num 0.1117 0.0656 0.0727 0.0993 0.1011 ...
## $ trestbps : num 0.257 0.23 0.23 0.213 0.213 ...
## $ chol : num 0.413 0.443 0.362 0.418 0.628 ...
## $ thalach : num 0.266 0.332 0.305 0.316 0.289 ...
## $ oldpeak : num 0.00408 0.00621 0.00248 0.00142 0.00106 ...
clean datasetSave the clean dataset so you do not have to repeat these preprocessing steps.
save(clean, # variable(s) to be saved
file = "data/preprocessed/clean.RData") # file name
# You can always load the clean dataset with
load("data/preprocessed/clean.RData")
One handy way to keep your machine learning organized is by storing all the components in a list. Let’s call our list heart_class, for our task of heart disease classification.
Our first two list components will be the dataset and outcome variable. In the heart dataset, the target variable is coded with a 1 if the patient had heart disease and a 0 if they did not.
heart_class <- list(
data = clean,
outcome = "target"
)
# View the list contents
names(heart_class)
## [1] "data" "outcome"
# Access part of the list with the dollar sign
head(heart_class$data)
## sex fbs exang target cp_1 cp_2 cp_3 restecg_1 restecg_2 slope_1 slope_2 ca_1
## 1 1 1 0 1 0 0 1 0 0 0 0 0
## 2 1 0 0 1 0 1 0 1 0 0 0 0
## 3 0 0 0 1 1 0 0 0 0 0 1 0
## 4 1 0 0 1 1 0 0 1 0 0 1 0
## 5 0 0 1 1 0 0 0 1 0 0 1 0
## 6 1 0 0 1 0 0 0 1 0 1 0 0
## ca_2 ca_3 ca_4 thal_1 thal_2 thal_3 age trestbps chol thalach
## 1 0 0 0 1 0 0 0.11170213 0.2570922 0.4131206 0.2659574
## 2 0 0 0 0 1 0 0.06560284 0.2304965 0.4432624 0.3315603
## 3 0 0 0 0 1 0 0.07269504 0.2304965 0.3617021 0.3049645
## 4 0 0 0 0 1 0 0.09929078 0.2127660 0.4184397 0.3156028
## 5 0 0 0 0 1 0 0.10106383 0.2127660 0.6276596 0.2890071
## 6 0 0 0 1 0 0 0.10106383 0.2482270 0.3404255 0.2624113
## oldpeak
## 1 0.0040780142
## 2 0.0062056738
## 3 0.0024822695
## 4 0.0014184397
## 5 0.0010638298
## 6 0.0007092199
heart_class$outcome
## [1] "target"
heart_class$covariates <- setdiff(names(heart_class$data), heart_class$outcome)
# The covariate names appear in the heart_class list!
names(heart_class)
## [1] "data" "outcome" "covariates"
# Call the covariates with (notice that "target" is excluded)
heart_class$covariates
## [1] "sex" "fbs" "exang" "cp_1" "cp_2" "cp_3"
## [7] "restecg_1" "restecg_2" "slope_1" "slope_2" "ca_1" "ca_2"
## [13] "ca_3" "ca_4" "thal_1" "thal_2" "thal_3" "age"
## [19] "trestbps" "chol" "thalach" "oldpeak"
set.seed(1)
training_rows <-
caret::createDataPartition(heart_class$data[[heart_class$outcome]],
p = 0.70,
list = FALSE)
heart_class$train_rows <- training_rows
# We have added the training rows to our list
names(heart_class)
## [1] "data" "outcome" "covariates" "train_rows"
head(heart_class$train_rows)
## Resample1
## [1,] 1
## [2,] 2
## [3,] 4
## [4,] 5
## [5,] 6
## [6,] 9
x_train <- heart_class$data[heart_class$train_rows, heart_class$covariates]
y_train <- heart_class$data[heart_class$train_rows, heart_class$outcome]
x_test <- heart_class$data[-heart_class$train_rows, heart_class$covariates]
y_test <- heart_class$data[-heart_class$train_rows, heart_class$outcome]
# Mean and frequencies of training set
mean(heart_class$data[training_rows, "target"])
## [1] 0.5446009
table(heart_class$data[training_rows, "target"])
##
## 0 1
## 97 116
# Mean and frequencies of test set
mean(heart_class$data[-training_rows, "target"])
## [1] 0.5444444
table(heart_class$data[-training_rows, "target"])
##
## 0 1
## 41 49
# Add these variables to our list for safekeeping
heart_class$x_train <- x_train
heart_class$y_train <- y_train
heart_class$x_test <- x_test
heart_class$y_test <- y_test
names(heart_class)
## [1] "data" "outcome" "covariates" "train_rows" "x_train"
## [6] "y_train" "x_test" "y_test"
# For example,
heart_class$y_train
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# Also save X and Y for the ensemble
heart_class$Y <- clean$target
heart_class$X <- clean[,-4]
heart_classThis way, you do not have to repeat all of these steps. You can just load("data/preprocessed/heart_class.RData")
names(heart_class)
## [1] "data" "outcome" "covariates" "train_rows" "x_train"
## [6] "y_train" "x_test" "y_test" "Y" "X"
save(heart_class,
file = "data/preprocessed/heart_class.RData")
# Load it with the load() function
load("data/preprocessed/heart_class.RData")
It is often good to fit a single model to prototype your machine learning goals.
Let’s fit three! Click the links for more information.
SuperLearner supports a wide variety of useful algorithms:
listWrappers()
## All prediction algorithm wrappers in SuperLearner:
## [1] "SL.bartMachine" "SL.bayesglm" "SL.biglasso"
## [4] "SL.caret" "SL.caret.rpart" "SL.cforest"
## [7] "SL.earth" "SL.extraTrees" "SL.gam"
## [10] "SL.gbm" "SL.glm" "SL.glm.interaction"
## [13] "SL.glmnet" "SL.ipredbagg" "SL.kernelKnn"
## [16] "SL.knn" "SL.ksvm" "SL.lda"
## [19] "SL.leekasso" "SL.lm" "SL.loess"
## [22] "SL.logreg" "SL.mean" "SL.nnet"
## [25] "SL.nnls" "SL.polymars" "SL.qda"
## [28] "SL.randomForest" "SL.ranger" "SL.ridge"
## [31] "SL.rpart" "SL.rpartPrune" "SL.speedglm"
## [34] "SL.speedlm" "SL.step" "SL.step.forward"
## [37] "SL.step.interaction" "SL.stepAIC" "SL.svm"
## [40] "SL.template" "SL.xgboost"
##
## All screening algorithm wrappers in SuperLearner:
## [1] "All"
## [1] "screen.corP" "screen.corRank" "screen.glmnet"
## [4] "screen.randomForest" "screen.SIS" "screen.template"
## [7] "screen.ttest" "write.screen.template"
A decision tree is an algorithm is useful because they are relatively simple to construct and interpret. They can work with categorical data directly, without the need for one-hot encoding.
Decision trees work by recursively partitioning, or partitioning, the feature space into smaller regions that contain less data. Splitting takes place at nodes and each directional split is called a branch. The top node is called the root node and seeks to find the most discriminating split, thus always putting the most optimal partition first.
Keep in mind that single decision trees are prone to overfitting and high variance.
The tree stops partitioning when it fails to meet some hyperparameter condition. See ?rpart.control for a list of tunings.
Unhashtag the below code to reproduce the tree figure for the training data.
Check out this tutorial for a step-by-step decision tree walkthrough
Read this stackoverflow thread about decision tree information gain.
?rpart.control
# install.packages("rpart.plot")
# library(rpart.plot)
# (dt_example_plot <- rpart(y_train ~ .,
# data = x_train,
# method = "class",
# parms = list(split = "information")))
# rpart.plot(dt_example_plot)
training set decision tree plot
set.seed(1)
dt <- SuperLearner(Y = heart_class$y_train,
X = heart_class$x_train,
family = binomial(),
SL.library = "SL.rpart")
dt
##
## Call:
## SuperLearner(Y = heart_class$y_train, X = heart_class$x_train, family = binomial(),
## SL.library = "SL.rpart")
##
##
## Risk Coef
## SL.rpart_All 0.1743429 1
# Accuracy
(dt_acc <- 1 - dt$cvRisk)
## SL.rpart_All
## 0.8256571
# Predict on test set
dt_predictions = predict(dt, heart_class$x_test, onlySL = TRUE)
str(dt_predictions)
## List of 2
## $ pred : num [1:90, 1] 0.917 0.917 0.652 0.917 0.917 ...
## $ library.predict: num [1:90, 1] 0.917 0.917 0.652 0.917 0.917 ...
summary(dt_predictions$library.predict)
## V1
## Min. :0.09091
## 1st Qu.:0.11111
## Median :0.65217
## Mean :0.59937
## 3rd Qu.:0.91667
## Max. :0.91667
# Visualize predicted probabilities
qplot(dt_predictions$pred[, 1]) + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Compute AUC-ROC and save as variable
dt_rocr_score = ROCR::prediction(dt_predictions$pred, heart_class$y_test)
dt_auc_roc = ROCR::performance(dt_rocr_score, measure = "auc", x.measure = "cutoff")@y.values[[1]]
# AUC-ROC
dt_auc_roc
## [1] 0.7366849
The random forest algorithm is a big step up from a single decision tree.
It is a forest because it grows multiple decision trees.
It is random because it bootstrap samples (with replacement) two-thirds of the dataset for each tree (this is called bootstrap aggregating, or “bagging”) and evaluates performance on the remaining one-third of the data (called the “out-of-bag” sample).
It is also random because instead of trying to place the most discriminating split as the root node, it selects a random sample of features to try at each split. This allows for good splits to be followed by bad splits, and for bad splits to be followed by good splits thus helping decorrelate the average performance estimator for a more robust conclusion.
See ?ranger for hyperparameter tuning options.
?ranger
set.seed(1)
rf <- SuperLearner(Y = heart_class$y_train,
X = heart_class$x_train,
family = binomial(),
SL.library = "SL.ranger")
rf
##
## Call:
## SuperLearner(Y = heart_class$y_train, X = heart_class$x_train, family = binomial(),
## SL.library = "SL.ranger")
##
##
## Risk Coef
## SL.ranger_All 0.1293912 1
# Accuracy
(rf_acc <- 1 - rf$cvRisk)
## SL.ranger_All
## 0.8706088
rf_predictions = predict(rf, heart_class$x_test, onlySL = TRUE)
str(rf_predictions)
## List of 2
## $ pred : num [1:90, 1] 0.97 0.762 0.566 0.901 0.941 ...
## $ library.predict: num [1:90, 1] 0.97 0.762 0.566 0.901 0.941 ...
summary(rf_predictions$library.predict)
## V1
## Min. :0.008188
## 1st Qu.:0.337193
## Median :0.619751
## Mean :0.592442
## 3rd Qu.:0.874289
## Max. :0.992931
# Visualize predicted probabilities
qplot(rf_predictions$pred[, 1]) + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Compute AUC-ROC and save as variable
rf_rocr_score = ROCR::prediction(rf_predictions$pred, heart_class$y_test)
rf_auc_roc = ROCR::performance(rf_rocr_score, measure = "auc", x.measure = "cutoff")@y.values[[1]]
# AUC-ROC
rf_auc_roc
## [1] 0.8596317
The lasso (least absolute square shrinkage operator) is a form of penalized regression and classification that shrinks the beta coefficients to zero of features that are not related to the outcome variable.
We can follow the “one standard error rule”, which allows us to select a higher lambda value (the regularization parameter) within one standard error of the minimum value and sacrifice a little error for a model that contains less variables but is ostensibly easier to interpret.
View the help files with ?glmnet and ?cv.glmnet to learn more.
?glmnet
?cv.glmnet
set.seed(1)
lasso <- cv.glmnet(x = as.matrix(heart_class$X),
y = heart_class$Y,
family = "binomial",
alpha = 1)
lasso
##
## Call: cv.glmnet(x = as.matrix(heart_class$X), y = heart_class$Y, family = "binomial", alpha = 1)
##
## Measure: Binomial Deviance
##
## Lambda Index Measure SE Nonzero
## min 0.01012 36 0.7631 0.06686 18
## 1se 0.03392 23 0.8285 0.04659 15
# View path plot
plot(lasso)
# View coefficients
plot(lasso$glmnet.fit, xvar = "lambda", label = TRUE)
# Show coefficients for 1se model
(coef_min = coef(lasso, s = "lambda.1se"))
## 23 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -0.5608427
## sex -0.4290884
## fbs .
## exang -0.6548822
## cp_1 0.1442182
## cp_2 0.6982435
## cp_3 0.3563876
## restecg_1 0.1032213
## restecg_2 .
## slope_1 -0.3557496
## slope_2 0.1325356
## ca_1 -0.8761617
## ca_2 -1.2264879
## ca_3 -0.7966503
## ca_4 .
## thal_1 .
## thal_2 0.6554511
## thal_3 -0.5795217
## age .
## trestbps .
## chol .
## thalach 6.1897480
## oldpeak -165.2682982
set.seed(1)
la <- SuperLearner(Y = heart_class$y_train,
X = heart_class$x_train,
family = binomial(),
SL.library = "SL.glmnet")
la
##
## Call:
## SuperLearner(Y = heart_class$y_train, X = heart_class$x_train, family = binomial(),
## SL.library = "SL.glmnet")
##
##
## Risk Coef
## SL.glmnet_All 0.1178595 1
# 0.86 accuracy
(la_risk <- 1-la$cvRisk)
## SL.glmnet_All
## 0.8821405
la_predictions = predict(la, heart_class$x_test, onlySL = TRUE)
str(la_predictions)
## List of 2
## $ pred : num [1:90, 1] 0.916 0.74 0.749 0.859 0.965 ...
## $ library.predict: num [1:90, 1] 0.916 0.74 0.749 0.859 0.965 ...
summary(la_predictions$library.predict)
## V1
## Min. :0.008801
## 1st Qu.:0.299391
## Median :0.633026
## Mean :0.580456
## 3rd Qu.:0.903609
## Max. :0.988807
# Visualize predicted probabilities
qplot(la_predictions$pred[, 1]) + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Compute AUC-ROC and save as variable
la_rocr_score = ROCR::prediction(la_predictions$pred, heart_class$y_test)
la_auc_roc = ROCR::performance(la_rocr_score, measure = "auc", x.measure = "cutoff")@y.values[[1]]
# AUC-ROC
la_auc_roc
## [1] 0.8939771
Now, lets fit an ensemble with two more algorithms include:
set.seed(1)
cv_sl <- CV.SuperLearner(Y = heart_class$Y, # heart disease yes/no
X = heart_class$X, # excluding the "target" variable
family = binomial(),
# For your own research, 5, 10, or 20 are good options
cvControl = list(V = 5),
innerCvControl = list(list(V = 5)),
verbose = FALSE,
method = "method.NNLS",
SL.library = c("SL.rpart", # decision tree
"SL.ranger", # random forest
"SL.glmnet", # lasso
"SL.xgboost", # boosted trees
"SL.mean")) # Y mean
## Warning in CV.SuperLearner(Y = heart_class$Y, X = heart_class$X, family =
## binomial(), : Only a single innerCvControl is given, will be replicated across
## all cross-validation split calls to SuperLearner
# Learn more about what is contained within the model
cv_sl
##
## Call:
## CV.SuperLearner(Y = heart_class$Y, X = heart_class$X, family = binomial(),
## SL.library = c("SL.rpart", "SL.ranger", "SL.glmnet", "SL.xgboost", "SL.mean"),
## method = "method.NNLS", verbose = FALSE, cvControl = list(V = 5), innerCvControl = list(list(V = 5)))
##
##
##
## Cross-validated predictions from the SuperLearner: SL.predict
##
## Cross-validated predictions from the discrete super learner (cross-validation selector): discreteSL.predict
##
## Which library algorithm was the discrete super learner: whichDiscreteSL
##
## Cross-validated prediction for all algorithms in the library: library.predict
# View the risk table
summary(cv_sl)
##
## Call:
## CV.SuperLearner(Y = heart_class$Y, X = heart_class$X, family = binomial(),
## SL.library = c("SL.rpart", "SL.ranger", "SL.glmnet", "SL.xgboost", "SL.mean"),
## method = "method.NNLS", verbose = FALSE, cvControl = list(V = 5), innerCvControl = list(list(V = 5)))
##
##
## Risk is based on: Mean Squared Error
##
## All risk estimates are based on V = 5
##
## Algorithm Ave se Min Max
## Super Learner 0.11722 0.0113943 0.088134 0.15460
## Discrete SL 0.11607 0.0116867 0.084784 0.15543
## SL.rpart_All 0.18315 0.0167049 0.142558 0.23328
## SL.ranger_All 0.13734 0.0105342 0.122008 0.16847
## SL.glmnet_All 0.11607 0.0116867 0.084784 0.15543
## SL.xgboost_All 0.14758 0.0139655 0.110688 0.19415
## SL.mean_All 0.24944 0.0026639 0.244474 0.25555
# View the discrete winner
table(simplify2array(cv_sl$whichDiscreteSL))
##
## SL.glmnet_All
## 5
# View the AUC-ROC table for:
# 1. The individual algorithms
# 2. The discrete winner
# 3. The SuperLearner ensemble
ck37r::auc_table(cv_sl)
## auc se ci_lower ci_upper p-value
## SL.mean_All 0.5000000 0.05767758 0.3869540 0.6130460 4.666398e-13
## SL.rpart_All 0.7753934 0.03589421 0.7050420 0.8457447 7.213063e-05
## SL.xgboost_All 0.8657012 0.02082595 0.8248831 0.9065194 1.340076e-02
## SL.ranger_All 0.8851278 0.01857296 0.8487255 0.9215302 7.535160e-02
## SuperLearner 0.9100506 0.01667676 0.8773648 0.9427365 4.577962e-01
## SL.glmnet_All 0.9118182 0.01638189 0.8797102 0.9439261 5.000000e-01
## DiscreteSL 0.9118182 0.01638189 0.8797102 0.9439261 5.000000e-01
# Visualize the cross-validated risk
plot.CV.SuperLearner(cv_sl) + theme_bw()
# Print the weights table
cvsl_weights(cv_sl)
## # Learner Mean SD Min Max
## 1 1 glmnet 0.84904 0.08951 0.77596 1.00000
## 2 2 ranger 0.13957 0.08266 0.00000 0.22044
## 3 3 rpart 0.00811 0.01620 0.00000 0.03695
## 4 4 xgboost 0.00328 0.00733 0.00000 0.01638
## 5 5 mean 0.00000 0.00000 0.00000 0.00000
# Compute ensemble AUC-ROC
ck37r::cvsl_auc(cv_sl)
## $cvAUC
## [1] 0.9100506
##
## $se
## [1] 0.01667676
##
## $ci
## [1] 0.8773648 0.9427365
##
## $confidence
## [1] 0.95
# Plot ROC curve
ck37r::plot_roc(cv_sl)
# View rough estimates of variable importance
set.seed(1)
var_importance <- ck37r::vim_corr(covariates = heart_class$covariates,
outcome = heart_class$outcome,
data = clean,
bootse = FALSE,
verbose = TRUE)
var_importance
## rank variable corr p_value p_value_fdr avg_con
## 015 1 thal_2 0.52733355 4.356225e-23 9.583695e-22 0.260869565
## 016 2 thal_3 -0.48611215 2.241269e-19 2.465396e-18 0.644927536
## 02 3 exang -0.43675708 1.520814e-15 1.115263e-14 0.550724638
## 021 4 oldpeak -0.43069600 4.085346e-15 2.246941e-14 0.002811183
## 020 5 thalach 0.42174093 1.697338e-14 7.468286e-14 0.246633775
## 09 6 slope_2 0.39406637 1.068761e-12 3.918791e-12 0.253623188
## 08 7 slope_1 -0.36205330 8.142433e-11 2.559051e-10 0.659420290
## 04 8 cp_2 0.31674216 1.735460e-08 4.772515e-08 0.130434783
## 0 9 sex -0.28093658 6.678692e-07 1.632569e-06 0.826086957
## 011 10 ca_2 -0.27399787 1.280668e-06 2.817470e-06 0.224637681
## 03 11 cp_1 0.24587910 1.498284e-05 2.996568e-05 0.065217391
## 010 12 ca_1 -0.23241224 4.409069e-05 8.083294e-05 0.318840580
## 017 13 age -0.22543872 7.524801e-05 1.273428e-04 0.100357180
## 012 14 ca_3 -0.21061527 2.220536e-04 3.489414e-04 0.123188406
## 06 15 restecg_1 0.17532180 2.191648e-03 3.214418e-03 0.405797101
## 018 16 trestbps -0.14493113 1.154606e-02 1.587583e-02 0.238295303
## 014 17 thal_1 -0.10658897 6.388288e-02 8.267196e-02 0.086956522
## 05 18 cp_3 0.08695687 1.309784e-01 1.600847e-01 0.050724638
## 019 19 chol -0.08523911 1.387903e-01 1.607046e-01 0.445189639
## 07 20 restecg_2 -0.06841024 2.351175e-01 2.586292e-01 0.021739130
## 013 21 ca_4 0.06644102 2.488996e-01 2.607520e-01 0.007246377
## 01 22 fbs -0.02804576 6.267775e-01 6.267775e-01 0.159420290
## avg_case note
## 015 0.787878788
## 016 0.169696970
## 02 0.139393939
## 021 0.001033742
## 020 0.280969267
## 09 0.648484848
## 08 0.296969697
## 04 0.418181818
## 0 0.563636364
## 011 0.042424242
## 03 0.248484848
## 010 0.127272727
## 017 0.093079734
## 012 0.018181818
## 06 0.581818182
## 018 0.229260692
## 014 0.036363636
## 05 0.096969697
## 019 0.429486353
## 07 0.006060606
## 013 0.024242424
## 01 0.139393939
Read Chapter 9 and Chapter 10 in the Guide to SuperLearner to learn to customize model hyperparameters and test algorithms with multiple hyperparameter settings.
Refit the ensemble with a few new, tuned algorithms compared to the existing algorithms contained within the SL.library parameter inside of CV.SuperLearner above (“SL.rpart”, “SL.ranger”, “SL.glmnet”, “SL.xgboost”, “SL.mean”).
What changed, and how were you able to optimize performance?
How did you find out which hyperparameters to tune?
HINT: type
?rpart,?ranger, etc.
caret: https://topepo.github.io/caret/
tidymodels: https://www.tidymodels.org/
mikropml: http://www.schlosslab.org/mikropml/articles/paper.html
(coming soon)